Feature generation and representations for protein-protein interaction classification
نویسندگان
چکیده
Automatic detecting protein-protein interaction (PPI) relevant articles is a crucial step for large-scale biological database curation. The previous work adopted POS tagging, shallow parsing and sentence splitting techniques, but they achieved worse performance than the simple bag-of-words representation. In this paper, we generated and investigated multiple types of feature representations in order to further improve the performance of PPI text classification task. Besides the traditional domain-independent bag-of-words approach and the term weighting methods, we also explored other domain-dependent features, i.e. protein-protein interaction trigger keywords, protein named entities and the advanced ways of incorporating Natural Language Processing (NLP) output. The integration of these multiple features has been evaluated on the BioCreAtIvE II corpus. The experimental results showed that both the advanced way of using NLP output and the integration of bag-of-words and NLP output improved the performance of text classification. Specifically, in comparison with the best performance achieved in the BioCreAtIvE II IAS, the feature-level and classifier-level integration of multiple features improved the performance of classification 2.71% and 3.95%, respectively.
منابع مشابه
Prediction of Protein Sub-Mitochondria Locations Using Protein Interaction Networks
Background: Prediction of the protein localization is among the most important issues in the bioinformatics that is used for the prediction of the proteins in the cells and organelles such as mitochondria. In this study, several machine learning algorithms are applied for the prediction of the intracellular protein locations. These algorithms use the features extracted from pro...
متن کاملGENERATING FUZZY RULES FOR PROTEIN CLASSIFICATION
This paper considers the generation of some interpretable fuzzy rules for assigning an amino acid sequence into the appropriate protein superfamily. Since the main objective of this classifier is the interpretability of rules, we have used the distribution of amino acids in the sequences of proteins as features. These features are the occurrence probabilities of six exchange groups in the seque...
متن کاملProtein-Protein Interaction Analysis of Common Top Genes in Obsessive-Compulsive disorder (OCD) and Schizophrenia: Towards New Drug Approach
Comorbidty is common among psychiatric disorders including obsessive-compulsive disorder and schizophrenia with a high rate. Many studies suggested that the disorders may have same etiological bases. In this regard, shared pathways of glutamate, dopaminergic, and serotonin are the known ones. Here, the common significant genes are examined to understand the possible molecular origin of the diso...
متن کاملProtein-Protein Interaction Analysis of Common Top Genes in Obsessive-Compulsive disorder (OCD) and Schizophrenia: Towards New Drug Approach
Comorbidty is common among psychiatric disorders including obsessive-compulsive disorder and schizophrenia with a high rate. Many studies suggested that the disorders may have same etiological bases. In this regard, shared pathways of glutamate, dopaminergic, and serotonin are the known ones. Here, the common significant genes are examined to understand the possible molecular origin of the diso...
متن کاملConstruction and Analysis of Tissue-Specific Protein-Protein Interaction Networks in Humans
We have studied the changes in protein-protein interaction network of 38 different tissues of the human body. 123 gene expression samples from these tissues were used to construct human protein-protein interaction network. This network is then pruned using the gene expression samples of each tissue to construct different protein-protein interaction networks corresponding to different studied ti...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Journal of biomedical informatics
دوره 42 5 شماره
صفحات -
تاریخ انتشار 2009